{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Working with Text Data and Naive Bayes in scikit-learn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Agenda\n",
    "\n",
    "**Working with text data**\n",
    "\n",
    "- Representing text as data\n",
    "- Reading SMS data\n",
    "- Vectorizing SMS data\n",
    "- Examining the tokens and their counts\n",
    "- Bonus: Calculating the \"spamminess\" of each token\n",
    "\n",
    "**Naive Bayes classification**\n",
    "\n",
    "- Building a Naive Bayes model\n",
    "- Comparing Naive Bayes with logistic regression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 1: Representing text as data\n",
    "\n",
    "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n",
    "\n",
    "> Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.\n",
    "\n",
    "We will use [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to \"convert text into a matrix of token counts\":"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction.text import CountVectorizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# start with a simple example\n",
    "simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[u'cab', u'call', u'me', u'please', u'tonight', u'you']"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# learn the 'vocabulary' of the training data\n",
    "vect = CountVectorizer()\n",
    "vect.fit(simple_train)\n",
    "vect.get_feature_names()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<3x6 sparse matrix of type '<type 'numpy.int64'>'\n",
       "\twith 9 stored elements in Compressed Sparse Row format>"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# transform training data into a 'document-term matrix'\n",
    "simple_train_dtm = vect.transform(simple_train)\n",
    "simple_train_dtm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  (0, 1)\t1\n",
      "  (0, 4)\t1\n",
      "  (0, 5)\t1\n",
      "  (1, 0)\t1\n",
      "  (1, 1)\t1\n",
      "  (1, 2)\t1\n",
      "  (2, 1)\t1\n",
      "  (2, 2)\t1\n",
      "  (2, 3)\t2\n"
     ]
    }
   ],
   "source": [
    "# print the sparse matrix\n",
    "print simple_train_dtm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[0, 1, 0, 0, 1, 1],\n",
       "       [1, 1, 1, 0, 0, 0],\n",
       "       [0, 1, 1, 2, 0, 0]], dtype=int64)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# convert sparse matrix to a dense matrix\n",
    "simple_train_dtm.toarray()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cab</th>\n",
       "      <th>call</th>\n",
       "      <th>me</th>\n",
       "      <th>please</th>\n",
       "      <th>tonight</th>\n",
       "      <th>you</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   cab  call  me  please  tonight  you\n",
       "0    0     1   0       0        1    1\n",
       "1    1     1   1       0        0    0\n",
       "2    0     1   1       2        0    0"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# examine the vocabulary and document-term matrix together\n",
    "import pandas as pd\n",
    "pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):\n",
    "\n",
    "> In this scheme, features and samples are defined as follows:\n",
    "\n",
    "> - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.\n",
    "> - The vector of all the token frequencies for a given document is considered a multivariate **sample**.\n",
    "\n",
    "> A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus.\n",
    "\n",
    "> We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or \"Bag of n-grams\" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[0, 1, 1, 1, 0, 0]], dtype=int64)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# transform testing data into a document-term matrix (using existing vocabulary)\n",
    "simple_test = [\"please don't call me\"]\n",
    "simple_test_dtm = vect.transform(simple_test)\n",
    "simple_test_dtm.toarray()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cab</th>\n",
       "      <th>call</th>\n",
       "      <th>me</th>\n",
       "      <th>please</th>\n",
       "      <th>tonight</th>\n",
       "      <th>you</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   cab  call  me  please  tonight  you\n",
       "0    0     1   1       1        0    0"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# examine the vocabulary and document-term matrix together\n",
    "pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Summary:**\n",
    "\n",
    "- `vect.fit(train)` learns the vocabulary of the training data\n",
    "- `vect.transform(train)` uses the fitted vocabulary to build a document-term matrix from the training data\n",
    "- `vect.transform(test)` uses the fitted vocabulary to build a document-term matrix from the testing data (and ignores tokens it hasn't seen before)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 2: Reading SMS data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(5572, 2)\n"
     ]
    }
   ],
   "source": [
    "# read tab-separated file\n",
    "url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/sms.tsv'\n",
    "col_names = ['label', 'message']\n",
    "sms = pd.read_table(url, sep='\\t', header=None, names=col_names)\n",
    "print sms.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>message</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>ham</td>\n",
       "      <td>Go until jurong point, crazy.. Available only ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>ham</td>\n",
       "      <td>Ok lar... Joking wif u oni...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>spam</td>\n",
       "      <td>Free entry in 2 a wkly comp to win FA Cup fina...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>ham</td>\n",
       "      <td>U dun say so early hor... U c already then say...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>ham</td>\n",
       "      <td>Nah I don't think he goes to usf, he lives aro...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>spam</td>\n",
       "      <td>FreeMsg Hey there darling it's been 3 week's n...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>ham</td>\n",
       "      <td>Even my brother is not like to speak with me. ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>ham</td>\n",
       "      <td>As per your request 'Melle Melle (Oru Minnamin...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>spam</td>\n",
       "      <td>WINNER!! As a valued network customer you have...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>spam</td>\n",
       "      <td>Had your mobile 11 months or more? U R entitle...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>ham</td>\n",
       "      <td>I'm gonna be home soon and i don't want to tal...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>spam</td>\n",
       "      <td>SIX chances to win CASH! From 100 to 20,000 po...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>spam</td>\n",
       "      <td>URGENT! You have won a 1 week FREE membership ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>ham</td>\n",
       "      <td>I've been searching for the right words to tha...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>ham</td>\n",
       "      <td>I HAVE A DATE ON SUNDAY WITH WILL!!</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>spam</td>\n",
       "      <td>XXXMobileMovieClub: To use your credit, click ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>ham</td>\n",
       "      <td>Oh k...i'm watching here:)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>ham</td>\n",
       "      <td>Eh u remember how 2 spell his name... Yes i di...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>ham</td>\n",
       "      <td>Fine if thats the way u feel. Thats the way ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>spam</td>\n",
       "      <td>England v Macedonia - dont miss the goals/team...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   label                                            message\n",
       "0    ham  Go until jurong point, crazy.. Available only ...\n",
       "1    ham                      Ok lar... Joking wif u oni...\n",
       "2   spam  Free entry in 2 a wkly comp to win FA Cup fina...\n",
       "3    ham  U dun say so early hor... U c already then say...\n",
       "4    ham  Nah I don't think he goes to usf, he lives aro...\n",
       "5   spam  FreeMsg Hey there darling it's been 3 week's n...\n",
       "6    ham  Even my brother is not like to speak with me. ...\n",
       "7    ham  As per your request 'Melle Melle (Oru Minnamin...\n",
       "8   spam  WINNER!! As a valued network customer you have...\n",
       "9   spam  Had your mobile 11 months or more? U R entitle...\n",
       "10   ham  I'm gonna be home soon and i don't want to tal...\n",
       "11  spam  SIX chances to win CASH! From 100 to 20,000 po...\n",
       "12  spam  URGENT! You have won a 1 week FREE membership ...\n",
       "13   ham  I've been searching for the right words to tha...\n",
       "14   ham                I HAVE A DATE ON SUNDAY WITH WILL!!\n",
       "15  spam  XXXMobileMovieClub: To use your credit, click ...\n",
       "16   ham                         Oh k...i'm watching here:)\n",
       "17   ham  Eh u remember how 2 spell his name... Yes i di...\n",
       "18   ham  Fine if thats the way u feel. Thats the way ...\n",
       "19  spam  England v Macedonia - dont miss the goals/team..."
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sms.head(20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "ham     4825\n",
       "spam     747\n",
       "dtype: int64"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sms.label.value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# convert label to a numeric variable\n",
    "sms['label'] = sms.label.map({'ham':0, 'spam':1})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# define X and y\n",
    "X = sms.message\n",
    "y = sms.label"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(4179L,)\n",
      "(1393L,)\n"
     ]
    }
   ],
   "source": [
    "# split into training and testing sets\n",
    "from sklearn.cross_validation import train_test_split\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)\n",
    "print X_train.shape\n",
    "print X_test.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 3: Vectorizing SMS data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# instantiate the vectorizer\n",
    "vect = CountVectorizer()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<4179x7456 sparse matrix of type '<type 'numpy.int64'>'\n",
       "\twith 55209 stored elements in Compressed Sparse Row format>"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# learn training data vocabulary, then create document-term matrix\n",
    "vect.fit(X_train)\n",
    "X_train_dtm = vect.transform(X_train)\n",
    "X_train_dtm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<4179x7456 sparse matrix of type '<type 'numpy.int64'>'\n",
       "\twith 55209 stored elements in Compressed Sparse Row format>"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# alternative: combine fit and transform into a single step\n",
    "X_train_dtm = vect.fit_transform(X_train)\n",
    "X_train_dtm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<1393x7456 sparse matrix of type '<type 'numpy.int64'>'\n",
       "\twith 17604 stored elements in Compressed Sparse Row format>"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# transform testing data (using fitted vocabulary) into a document-term matrix\n",
    "X_test_dtm = vect.transform(X_test)\n",
    "X_test_dtm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 4: Examining the tokens and their counts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# store token names\n",
    "X_train_tokens = vect.get_feature_names()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[u'00', u'000', u'008704050406', u'0121', u'01223585236', u'01223585334', u'0125698789', u'02', u'0207', u'02072069400', u'02073162414', u'02085076972', u'021', u'03', u'04', u'0430', u'05', u'050703', u'0578', u'06', u'07', u'07008009200', u'07090201529', u'07090298926', u'07123456789', u'07732584351', u'07734396839', u'07742676969', u'0776xxxxxxx', u'07781482378', u'07786200117', u'078', u'07801543489', u'07808', u'07808247860', u'07808726822', u'07815296484', u'07821230901', u'07880867867', u'0789xxxxxxx', u'07946746291', u'0796xxxxxx', u'07973788240', u'07xxxxxxxxx', u'08', u'0800', u'08000407165', u'08000776320', u'08000839402', u'08000930705']\n"
     ]
    }
   ],
   "source": [
    "# first 50 tokens\n",
    "print X_train_tokens[:50]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[u'yer', u'yes', u'yest', u'yesterday', u'yet', u'yetunde', u'yijue', u'ym', u'ymca', u'yo', u'yoga', u'yogasana', u'yor', u'yorge', u'you', u'youdoing', u'youi', u'youphone', u'your', u'youre', u'yourjob', u'yours', u'yourself', u'youwanna', u'yowifes', u'yoyyooo', u'yr', u'yrs', u'ything', u'yummmm', u'yummy', u'yun', u'yunny', u'yuo', u'yuou', u'yup', u'zac', u'zaher', u'zealand', u'zebra', u'zed', u'zeros', u'zhong', u'zindgi', u'zoe', u'zoom', u'zouk', u'zyada', u'\\xe8n', u'\\u3028ud']\n"
     ]
    }
   ],
   "source": [
    "# last 50 tokens\n",
    "print X_train_tokens[-50:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[0, 0, 0, ..., 0, 0, 0],\n",
       "       [0, 0, 0, ..., 0, 0, 0],\n",
       "       [0, 0, 0, ..., 0, 0, 0],\n",
       "       ..., \n",
       "       [0, 0, 0, ..., 0, 0, 0],\n",
       "       [0, 0, 0, ..., 0, 0, 0],\n",
       "       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# view X_train_dtm as a dense matrix\n",
    "X_train_dtm.toarray()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 5, 23,  2, ...,  1,  1,  1], dtype=int64)"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# count how many times EACH token appears across ALL messages in X_train_dtm\n",
    "import numpy as np\n",
    "X_train_counts = np.sum(X_train_dtm.toarray(), axis=0)\n",
    "X_train_counts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(7456L,)"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_train_counts.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>count</th>\n",
       "      <th>token</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3727</th>\n",
       "      <td>1</td>\n",
       "      <td>jules</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4172</th>\n",
       "      <td>1</td>\n",
       "      <td>mallika</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4169</th>\n",
       "      <td>1</td>\n",
       "      <td>malarky</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4165</th>\n",
       "      <td>1</td>\n",
       "      <td>makiing</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4161</th>\n",
       "      <td>1</td>\n",
       "      <td>maintaining</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4158</th>\n",
       "      <td>1</td>\n",
       "      <td>mails</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4157</th>\n",
       "      <td>1</td>\n",
       "      <td>mailed</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4151</th>\n",
       "      <td>1</td>\n",
       "      <td>magicalsongs</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4150</th>\n",
       "      <td>1</td>\n",
       "      <td>maggi</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4149</th>\n",
       "      <td>1</td>\n",
       "      <td>magazine</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4146</th>\n",
       "      <td>1</td>\n",
       "      <td>madodu</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4143</th>\n",
       "      <td>1</td>\n",
       "      <td>mad2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4142</th>\n",
       "      <td>1</td>\n",
       "      <td>mad1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4140</th>\n",
       "      <td>1</td>\n",
       "      <td>macs</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4139</th>\n",
       "      <td>1</td>\n",
       "      <td>macleran</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4138</th>\n",
       "      <td>1</td>\n",
       "      <td>mack</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4174</th>\n",
       "      <td>1</td>\n",
       "      <td>manage</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4175</th>\n",
       "      <td>1</td>\n",
       "      <td>manageable</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4178</th>\n",
       "      <td>1</td>\n",
       "      <td>manchester</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4179</th>\n",
       "      <td>1</td>\n",
       "      <td>manda</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4201</th>\n",
       "      <td>1</td>\n",
       "      <td>marking</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4200</th>\n",
       "      <td>1</td>\n",
       "      <td>marketing</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4197</th>\n",
       "      <td>1</td>\n",
       "      <td>marine</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4196</th>\n",
       "      <td>1</td>\n",
       "      <td>margin</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4193</th>\n",
       "      <td>1</td>\n",
       "      <td>marandratha</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4192</th>\n",
       "      <td>1</td>\n",
       "      <td>maraikara</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4191</th>\n",
       "      <td>1</td>\n",
       "      <td>maps</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4136</th>\n",
       "      <td>1</td>\n",
       "      <td>machi</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4190</th>\n",
       "      <td>1</td>\n",
       "      <td>mapquest</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4187</th>\n",
       "      <td>1</td>\n",
       "      <td>manual</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2290</th>\n",
       "      <td>292</td>\n",
       "      <td>do</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7257</th>\n",
       "      <td>293</td>\n",
       "      <td>with</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7120</th>\n",
       "      <td>293</td>\n",
       "      <td>we</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6904</th>\n",
       "      <td>297</td>\n",
       "      <td>ur</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1081</th>\n",
       "      <td>298</td>\n",
       "      <td>at</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2995</th>\n",
       "      <td>299</td>\n",
       "      <td>get</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3465</th>\n",
       "      <td>302</td>\n",
       "      <td>if</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4778</th>\n",
       "      <td>306</td>\n",
       "      <td>or</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1522</th>\n",
       "      <td>332</td>\n",
       "      <td>but</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4647</th>\n",
       "      <td>338</td>\n",
       "      <td>not</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6017</th>\n",
       "      <td>344</td>\n",
       "      <td>so</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1574</th>\n",
       "      <td>349</td>\n",
       "      <td>can</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1016</th>\n",
       "      <td>358</td>\n",
       "      <td>are</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4662</th>\n",
       "      <td>361</td>\n",
       "      <td>now</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4743</th>\n",
       "      <td>390</td>\n",
       "      <td>on</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3235</th>\n",
       "      <td>416</td>\n",
       "      <td>have</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1552</th>\n",
       "      <td>443</td>\n",
       "      <td>call</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6539</th>\n",
       "      <td>453</td>\n",
       "      <td>that</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4704</th>\n",
       "      <td>460</td>\n",
       "      <td>of</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7424</th>\n",
       "      <td>508</td>\n",
       "      <td>your</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2821</th>\n",
       "      <td>518</td>\n",
       "      <td>for</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4489</th>\n",
       "      <td>550</td>\n",
       "      <td>my</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3623</th>\n",
       "      <td>568</td>\n",
       "      <td>it</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4238</th>\n",
       "      <td>601</td>\n",
       "      <td>me</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3612</th>\n",
       "      <td>679</td>\n",
       "      <td>is</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3502</th>\n",
       "      <td>683</td>\n",
       "      <td>in</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>929</th>\n",
       "      <td>717</td>\n",
       "      <td>and</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6542</th>\n",
       "      <td>1004</td>\n",
       "      <td>the</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7420</th>\n",
       "      <td>1660</td>\n",
       "      <td>you</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6656</th>\n",
       "      <td>1670</td>\n",
       "      <td>to</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>7456 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      count         token\n",
       "3727      1         jules\n",
       "4172      1       mallika\n",
       "4169      1       malarky\n",
       "4165      1       makiing\n",
       "4161      1   maintaining\n",
       "4158      1         mails\n",
       "4157      1        mailed\n",
       "4151      1  magicalsongs\n",
       "4150      1         maggi\n",
       "4149      1      magazine\n",
       "4146      1        madodu\n",
       "4143      1          mad2\n",
       "4142      1          mad1\n",
       "4140      1          macs\n",
       "4139      1      macleran\n",
       "4138      1          mack\n",
       "4174      1        manage\n",
       "4175      1    manageable\n",
       "4178      1    manchester\n",
       "4179      1         manda\n",
       "4201      1       marking\n",
       "4200      1     marketing\n",
       "4197      1        marine\n",
       "4196      1        margin\n",
       "4193      1   marandratha\n",
       "4192      1     maraikara\n",
       "4191      1          maps\n",
       "4136      1         machi\n",
       "4190      1      mapquest\n",
       "4187      1        manual\n",
       "...     ...           ...\n",
       "2290    292            do\n",
       "7257    293          with\n",
       "7120    293            we\n",
       "6904    297            ur\n",
       "1081    298            at\n",
       "2995    299           get\n",
       "3465    302            if\n",
       "4778    306            or\n",
       "1522    332           but\n",
       "4647    338           not\n",
       "6017    344            so\n",
       "1574    349           can\n",
       "1016    358           are\n",
       "4662    361           now\n",
       "4743    390            on\n",
       "3235    416          have\n",
       "1552    443          call\n",
       "6539    453          that\n",
       "4704    460            of\n",
       "7424    508          your\n",
       "2821    518           for\n",
       "4489    550            my\n",
       "3623    568            it\n",
       "4238    601            me\n",
       "3612    679            is\n",
       "3502    683            in\n",
       "929     717           and\n",
       "6542   1004           the\n",
       "7420   1660           you\n",
       "6656   1670            to\n",
       "\n",
       "[7456 rows x 2 columns]"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# create a DataFrame of tokens with their counts\n",
    "pd.DataFrame({'token':X_train_tokens, 'count':X_train_counts}).sort('count')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Bonus: Calculating the \"spamminess\" of each token"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# create separate DataFrames for ham and spam\n",
    "sms_ham = sms[sms.label==0]\n",
    "sms_spam = sms[sms.label==1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# learn the vocabulary of ALL messages and save it\n",
    "vect.fit(sms.message)\n",
    "all_tokens = vect.get_feature_names()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# create document-term matrices for ham and spam\n",
    "ham_dtm = vect.transform(sms_ham.message)\n",
    "spam_dtm = vect.transform(sms_spam.message)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# count how many times EACH token appears across ALL ham messages\n",
    "ham_counts = np.sum(ham_dtm.toarray(), axis=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# count how many times EACH token appears across ALL spam messages\n",
    "spam_counts = np.sum(spam_dtm.toarray(), axis=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# create a DataFrame of tokens with their separate ham and spam counts\n",
    "token_counts = pd.DataFrame({'token':all_tokens, 'ham':ham_counts, 'spam':spam_counts})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# add one to ham and spam counts to avoid dividing by zero (in the step that follows)\n",
    "token_counts['ham'] = token_counts.ham + 1\n",
    "token_counts['spam'] = token_counts.spam + 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ham</th>\n",
       "      <th>spam</th>\n",
       "      <th>token</th>\n",
       "      <th>spam_ratio</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3684</th>\n",
       "      <td>319</td>\n",
       "      <td>1</td>\n",
       "      <td>gt</td>\n",
       "      <td>0.003135</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4793</th>\n",
       "      <td>317</td>\n",
       "      <td>1</td>\n",
       "      <td>lt</td>\n",
       "      <td>0.003155</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3805</th>\n",
       "      <td>232</td>\n",
       "      <td>1</td>\n",
       "      <td>he</td>\n",
       "      <td>0.004310</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6843</th>\n",
       "      <td>168</td>\n",
       "      <td>1</td>\n",
       "      <td>she</td>\n",
       "      <td>0.005952</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4747</th>\n",
       "      <td>163</td>\n",
       "      <td>1</td>\n",
       "      <td>lor</td>\n",
       "      <td>0.006135</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2428</th>\n",
       "      <td>151</td>\n",
       "      <td>1</td>\n",
       "      <td>da</td>\n",
       "      <td>0.006623</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4550</th>\n",
       "      <td>136</td>\n",
       "      <td>1</td>\n",
       "      <td>later</td>\n",
       "      <td>0.007353</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1247</th>\n",
       "      <td>90</td>\n",
       "      <td>1</td>\n",
       "      <td>ask</td>\n",
       "      <td>0.011111</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6626</th>\n",
       "      <td>90</td>\n",
       "      <td>1</td>\n",
       "      <td>said</td>\n",
       "      <td>0.011111</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2714</th>\n",
       "      <td>89</td>\n",
       "      <td>1</td>\n",
       "      <td>doing</td>\n",
       "      <td>0.011236</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1084</th>\n",
       "      <td>89</td>\n",
       "      <td>1</td>\n",
       "      <td>amp</td>\n",
       "      <td>0.011236</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5167</th>\n",
       "      <td>80</td>\n",
       "      <td>1</td>\n",
       "      <td>morning</td>\n",
       "      <td>0.012500</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2163</th>\n",
       "      <td>231</td>\n",
       "      <td>3</td>\n",
       "      <td>come</td>\n",
       "      <td>0.012987</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1142</th>\n",
       "      <td>77</td>\n",
       "      <td>1</td>\n",
       "      <td>anything</td>\n",
       "      <td>0.012987</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2289</th>\n",
       "      <td>77</td>\n",
       "      <td>1</td>\n",
       "      <td>cos</td>\n",
       "      <td>0.012987</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4724</th>\n",
       "      <td>75</td>\n",
       "      <td>1</td>\n",
       "      <td>lol</td>\n",
       "      <td>0.013333</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7463</th>\n",
       "      <td>72</td>\n",
       "      <td>1</td>\n",
       "      <td>sure</td>\n",
       "      <td>0.013889</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7099</th>\n",
       "      <td>70</td>\n",
       "      <td>1</td>\n",
       "      <td>something</td>\n",
       "      <td>0.014286</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3690</th>\n",
       "      <td>68</td>\n",
       "      <td>1</td>\n",
       "      <td>gud</td>\n",
       "      <td>0.014706</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3171</th>\n",
       "      <td>63</td>\n",
       "      <td>1</td>\n",
       "      <td>feel</td>\n",
       "      <td>0.015873</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8394</th>\n",
       "      <td>63</td>\n",
       "      <td>1</td>\n",
       "      <td>went</td>\n",
       "      <td>0.015873</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5371</th>\n",
       "      <td>63</td>\n",
       "      <td>1</td>\n",
       "      <td>nice</td>\n",
       "      <td>0.015873</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3595</th>\n",
       "      <td>59</td>\n",
       "      <td>1</td>\n",
       "      <td>gonna</td>\n",
       "      <td>0.016949</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7001</th>\n",
       "      <td>59</td>\n",
       "      <td>1</td>\n",
       "      <td>sleep</td>\n",
       "      <td>0.016949</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1064</th>\n",
       "      <td>59</td>\n",
       "      <td>1</td>\n",
       "      <td>always</td>\n",
       "      <td>0.016949</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5254</th>\n",
       "      <td>755</td>\n",
       "      <td>13</td>\n",
       "      <td>my</td>\n",
       "      <td>0.017219</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5217</th>\n",
       "      <td>116</td>\n",
       "      <td>2</td>\n",
       "      <td>much</td>\n",
       "      <td>0.017241</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5533</th>\n",
       "      <td>115</td>\n",
       "      <td>2</td>\n",
       "      <td>oh</td>\n",
       "      <td>0.017391</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2815</th>\n",
       "      <td>56</td>\n",
       "      <td>1</td>\n",
       "      <td>dun</td>\n",
       "      <td>0.017857</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3925</th>\n",
       "      <td>166</td>\n",
       "      <td>3</td>\n",
       "      <td>home</td>\n",
       "      <td>0.018072</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1623</th>\n",
       "      <td>1</td>\n",
       "      <td>22</td>\n",
       "      <td>bonus</td>\n",
       "      <td>22.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3994</th>\n",
       "      <td>1</td>\n",
       "      <td>22</td>\n",
       "      <td>http</td>\n",
       "      <td>22.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6619</th>\n",
       "      <td>1</td>\n",
       "      <td>22</td>\n",
       "      <td>sae</td>\n",
       "      <td>22.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>735</th>\n",
       "      <td>1</td>\n",
       "      <td>22</td>\n",
       "      <td>8007</td>\n",
       "      <td>22.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>732</th>\n",
       "      <td>1</td>\n",
       "      <td>23</td>\n",
       "      <td>800</td>\n",
       "      <td>23.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5297</th>\n",
       "      <td>1</td>\n",
       "      <td>23</td>\n",
       "      <td>national</td>\n",
       "      <td>23.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8375</th>\n",
       "      <td>1</td>\n",
       "      <td>25</td>\n",
       "      <td>weekly</td>\n",
       "      <td>25.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8153</th>\n",
       "      <td>1</td>\n",
       "      <td>25</td>\n",
       "      <td>valid</td>\n",
       "      <td>25.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>309</th>\n",
       "      <td>1</td>\n",
       "      <td>25</td>\n",
       "      <td>10p</td>\n",
       "      <td>25.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>618</th>\n",
       "      <td>1</td>\n",
       "      <td>26</td>\n",
       "      <td>5000</td>\n",
       "      <td>26.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5117</th>\n",
       "      <td>1</td>\n",
       "      <td>26</td>\n",
       "      <td>mob</td>\n",
       "      <td>26.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>364</th>\n",
       "      <td>2</td>\n",
       "      <td>54</td>\n",
       "      <td>16</td>\n",
       "      <td>27.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2150</th>\n",
       "      <td>1</td>\n",
       "      <td>27</td>\n",
       "      <td>collection</td>\n",
       "      <td>27.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7838</th>\n",
       "      <td>1</td>\n",
       "      <td>27</td>\n",
       "      <td>tones</td>\n",
       "      <td>27.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2963</th>\n",
       "      <td>1</td>\n",
       "      <td>27</td>\n",
       "      <td>entry</td>\n",
       "      <td>27.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>30</td>\n",
       "      <td>000</td>\n",
       "      <td>30.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8596</th>\n",
       "      <td>3</td>\n",
       "      <td>99</td>\n",
       "      <td>www</td>\n",
       "      <td>33.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6525</th>\n",
       "      <td>1</td>\n",
       "      <td>33</td>\n",
       "      <td>ringtone</td>\n",
       "      <td>33.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>356</th>\n",
       "      <td>1</td>\n",
       "      <td>35</td>\n",
       "      <td>150ppm</td>\n",
       "      <td>35.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8016</th>\n",
       "      <td>2</td>\n",
       "      <td>75</td>\n",
       "      <td>uk</td>\n",
       "      <td>37.500000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1333</th>\n",
       "      <td>1</td>\n",
       "      <td>39</td>\n",
       "      <td>awarded</td>\n",
       "      <td>39.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>299</th>\n",
       "      <td>1</td>\n",
       "      <td>42</td>\n",
       "      <td>1000</td>\n",
       "      <td>42.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>617</th>\n",
       "      <td>1</td>\n",
       "      <td>45</td>\n",
       "      <td>500</td>\n",
       "      <td>45.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2371</th>\n",
       "      <td>1</td>\n",
       "      <td>45</td>\n",
       "      <td>cs</td>\n",
       "      <td>45.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3688</th>\n",
       "      <td>1</td>\n",
       "      <td>51</td>\n",
       "      <td>guaranteed</td>\n",
       "      <td>51.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>369</th>\n",
       "      <td>1</td>\n",
       "      <td>52</td>\n",
       "      <td>18</td>\n",
       "      <td>52.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7837</th>\n",
       "      <td>1</td>\n",
       "      <td>61</td>\n",
       "      <td>tone</td>\n",
       "      <td>61.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>352</th>\n",
       "      <td>1</td>\n",
       "      <td>72</td>\n",
       "      <td>150p</td>\n",
       "      <td>72.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6113</th>\n",
       "      <td>1</td>\n",
       "      <td>94</td>\n",
       "      <td>prize</td>\n",
       "      <td>94.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2067</th>\n",
       "      <td>1</td>\n",
       "      <td>114</td>\n",
       "      <td>claim</td>\n",
       "      <td>114.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>8713 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      ham  spam       token  spam_ratio\n",
       "3684  319     1          gt    0.003135\n",
       "4793  317     1          lt    0.003155\n",
       "3805  232     1          he    0.004310\n",
       "6843  168     1         she    0.005952\n",
       "4747  163     1         lor    0.006135\n",
       "2428  151     1          da    0.006623\n",
       "4550  136     1       later    0.007353\n",
       "1247   90     1         ask    0.011111\n",
       "6626   90     1        said    0.011111\n",
       "2714   89     1       doing    0.011236\n",
       "1084   89     1         amp    0.011236\n",
       "5167   80     1     morning    0.012500\n",
       "2163  231     3        come    0.012987\n",
       "1142   77     1    anything    0.012987\n",
       "2289   77     1         cos    0.012987\n",
       "4724   75     1         lol    0.013333\n",
       "7463   72     1        sure    0.013889\n",
       "7099   70     1   something    0.014286\n",
       "3690   68     1         gud    0.014706\n",
       "3171   63     1        feel    0.015873\n",
       "8394   63     1        went    0.015873\n",
       "5371   63     1        nice    0.015873\n",
       "3595   59     1       gonna    0.016949\n",
       "7001   59     1       sleep    0.016949\n",
       "1064   59     1      always    0.016949\n",
       "5254  755    13          my    0.017219\n",
       "5217  116     2        much    0.017241\n",
       "5533  115     2          oh    0.017391\n",
       "2815   56     1         dun    0.017857\n",
       "3925  166     3        home    0.018072\n",
       "...   ...   ...         ...         ...\n",
       "1623    1    22       bonus   22.000000\n",
       "3994    1    22        http   22.000000\n",
       "6619    1    22         sae   22.000000\n",
       "735     1    22        8007   22.000000\n",
       "732     1    23         800   23.000000\n",
       "5297    1    23    national   23.000000\n",
       "8375    1    25      weekly   25.000000\n",
       "8153    1    25       valid   25.000000\n",
       "309     1    25         10p   25.000000\n",
       "618     1    26        5000   26.000000\n",
       "5117    1    26         mob   26.000000\n",
       "364     2    54          16   27.000000\n",
       "2150    1    27  collection   27.000000\n",
       "7838    1    27       tones   27.000000\n",
       "2963    1    27       entry   27.000000\n",
       "1       1    30         000   30.000000\n",
       "8596    3    99         www   33.000000\n",
       "6525    1    33    ringtone   33.000000\n",
       "356     1    35      150ppm   35.000000\n",
       "8016    2    75          uk   37.500000\n",
       "1333    1    39     awarded   39.000000\n",
       "299     1    42        1000   42.000000\n",
       "617     1    45         500   45.000000\n",
       "2371    1    45          cs   45.000000\n",
       "3688    1    51  guaranteed   51.000000\n",
       "369     1    52          18   52.000000\n",
       "7837    1    61        tone   61.000000\n",
       "352     1    72        150p   72.000000\n",
       "6113    1    94       prize   94.000000\n",
       "2067    1   114       claim  114.000000\n",
       "\n",
       "[8713 rows x 4 columns]"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# calculate ratio of spam-to-ham for each token\n",
    "token_counts['spam_ratio'] = token_counts.spam / token_counts.ham\n",
    "token_counts.sort('spam_ratio')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 5: Building a Naive Bayes model\n",
    "\n",
    "We will use [Multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):\n",
    "\n",
    "> The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# train a Naive Bayes model using X_train_dtm\n",
    "from sklearn.naive_bayes import MultinomialNB\n",
    "nb = MultinomialNB()\n",
    "nb.fit(X_train_dtm, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# make class predictions for X_test_dtm\n",
    "y_pred_class = nb.predict(X_test_dtm)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.988513998564\n"
     ]
    }
   ],
   "source": [
    "# calculate accuracy of class predictions\n",
    "from sklearn import metrics\n",
    "print metrics.accuracy_score(y_test, y_pred_class)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[1203    5]\n",
      " [  11  174]]\n"
     ]
    }
   ],
   "source": [
    "# confusion matrix\n",
    "print metrics.confusion_matrix(y_test, y_pred_class)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([  2.87744864e-03,   1.83488846e-05,   2.07301295e-03, ...,\n",
       "         1.09026171e-06,   1.00000000e+00,   3.98279868e-09])"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# predict (poorly calibrated) probabilities\n",
    "y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]\n",
    "y_pred_prob"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.986643100054\n"
     ]
    }
   ],
   "source": [
    "# calculate AUC\n",
    "print metrics.roc_auc_score(y_test, y_pred_prob)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "574               Waiting for your call.\n",
       "3375             Also andros ice etc etc\n",
       "45      No calls..messages..missed calls\n",
       "3415             No pic. Please re-send.\n",
       "1988    No calls..messages..missed calls\n",
       "Name: message, dtype: object"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# print message text for the false positives\n",
    "X_test[y_test < y_pred_class]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3132    LookAtMe!: Thanks for your purchase of a video...\n",
       "5       FreeMsg Hey there darling it's been 3 week's n...\n",
       "3530    Xmas & New Years Eve tickets are now on sale f...\n",
       "684     Hi I'm sue. I am 20 years old and work as a la...\n",
       "1875    Would you like to see my XXX pics they are so ...\n",
       "1893    CALL 09090900040 & LISTEN TO EXTREME DIRTY LIV...\n",
       "4298    thesmszone.com lets you send free anonymous an...\n",
       "4949    Hi this is Amy, we will be sending you a free ...\n",
       "2821    INTERFLORA - It's not too late to order Inter...\n",
       "2247    Hi ya babe x u 4goten bout me?' scammers getti...\n",
       "4514    Money i have won wining number 946 wot do i do...\n",
       "Name: message, dtype: object"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# print message text for the false negatives\n",
    "X_test[y_test > y_pred_class]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\"LookAtMe!: Thanks for your purchase of a video clip from LookAtMe!, you've been charged 35p. Think you can do better? Why not send a video in a MMSto 32323.\""
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# what do you notice about the false negatives?\n",
    "X_test[3132]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 6: Comparing Naive Bayes with logistic regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LogisticRegression(C=1000000000.0, class_weight=None, dual=False,\n",
       "          fit_intercept=True, intercept_scaling=1, max_iter=100,\n",
       "          multi_class='ovr', penalty='l2', random_state=None,\n",
       "          solver='liblinear', tol=0.0001, verbose=0)"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# import/instantiate/fit\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "logreg = LogisticRegression(C=1e9)\n",
    "logreg.fit(X_train_dtm, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# class predictions and predicted probabilities\n",
    "y_pred_class = logreg.predict(X_test_dtm)\n",
    "y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.989231873654\n",
      "0.994144889923\n"
     ]
    }
   ],
   "source": [
    "# calculate accuracy and AUC\n",
    "print metrics.accuracy_score(y_test, y_pred_class)\n",
    "print metrics.roc_auc_score(y_test, y_pred_prob)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
